Skip to content

Handle duplicate FINISHED messages, improve driver diagnostics#54

Open
travisdowns wants to merge 2 commits into0.12.x-redpandafrom
td-duplicate-finished-handling
Open

Handle duplicate FINISHED messages, improve driver diagnostics#54
travisdowns wants to merge 2 commits into0.12.x-redpandafrom
td-duplicate-finished-handling

Conversation

@travisdowns
Copy link
Copy Markdown
Member

Summary

  • Handle duplicate FINISHED messages in _handle_finished by checking active_tests and finished_tests before processing. Logs elapsed time since original and returns early instead of crashing with KeyError.
  • Track last 100 received messages in a ring buffer, dumped on fatal exception for post-mortem diagnosis.
  • Add FATAL_ERROR_TAG and SLOW_EVENT_THRESHOLD_S (5s) logging to the main event loop.
  • Increase Sender.REQUEST_TIMEOUT_MS from 3s to 35s so the first send attempt outlasts the 30s _join_test_process timeout. Non-FINISHED events are not idempotent, so retries can cause duplicate processing.

Depends on #53.

Test plan

  • All 18 runner tests pass (check_runner.py + check_sender_receiver.py)

The driver's _handle_finished calls _join_test_process inline, which
blocks the single-threaded ZMQ REP loop for up to 30s. During that
block, clients retry their FINISHED messages, which queue up on the
REP socket. When the driver unblocks, duplicate FINISHED messages
cause a KeyError on `del self.active_tests[test_key]` that kills the
entire session.

Changes:

- Add idempotency guard in _handle_finished: check active_tests and
  finished_tests before processing. If the test was already finished,
  log elapsed time since original event_time. Return early instead of
  crashing with KeyError.

- Track last 100 received messages in a deque ring buffer. On fatal
  exception in the main event loop, dump the recent message history
  (test_key, event_type, event_id, message_id).

- Add FATAL_ERROR_TAG to the exception handler log line.

- Log a warning when event processing exceeds SLOW_EVENT_THRESHOLD_S
  (5s), using try/finally around _handle() so it fires on both
  success and exception paths.

- Increase Sender.REQUEST_TIMEOUT_MS from 3s to 35s so the first
  send attempt outlasts the 30s join timeout. This is needed because
  non-FINISHED events are not idempotent and cannot safely be retried.
@travisdowns travisdowns force-pushed the td-duplicate-finished-handling branch from 734900b to 17b0fe0 Compare March 3, 2026 16:01
Allow overriding REQUEST_TIMEOUT_MS via constructor param, defaulting
to the class constant (35s). This lets the check_timeout test pass a
small timeout (100ms) so the spawned Sender subprocess finishes in
~0.5s instead of ~175s (5 retries * 35s).
@travisdowns travisdowns requested review from a team and gousteris March 3, 2026 16:11
@travisdowns
Copy link
Copy Markdown
Member Author

@gousteris take a peek if you can.

Comment thread ducktape/tests/runner.py
f"test_id={event.get('test_id')}, test_index={event.get('test_index')}")
except Exception as e:
err_str = "Exception receiving message: %s: %s, active_tests: \n %s \n" % (str(type(e)), str(e), self.active_tests_debug())
err_str = "%s Exception receiving message: %s: %s, active_tests: \n %s \n" % (FATAL_ERROR_TAG, str(type(e)), str(e), self.active_tests_debug())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should CI on this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gousteris sorry I didn't understand the comment. Are you saying to do a CI run before checking in?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant if CI should fail when a fatal error is raised

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it should and it will as there is a raise below in this same exception handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants